Project 1: SAT & ACT Analysis

Problem Statement

Analyse SAT and ACT Scores from previous years to provide Recommendations on where money can be spent to increase SAT participation rates.

Import Libraries

If you combine your problem statement, executive summary, data dictionary, and conclusions/recommendations, you have an amazing README.md file that quickly aligns your audience to the contents of your project. Don't forget to cite your data sources!

All libraries used should be added here

In [2]:

Data Import and Cleaning: 2017

1. Read In SAT & ACT Data

Read in the sat_2017.csv and act_2017.csv files and assign them to appropriately named pandas dataframes.

In [4]:
In [5]:

2. Display Data

Print the first 10 rows of each dataframe to your jupyter notebook

In [6]:
Out[6]:
State Participation Evidence-Based Reading and Writing Math Total
0 Alabama 5% 593 572 1165
1 Alaska 38% 547 533 1080
2 Arizona 30% 563 553 1116
3 Arkansas 3% 614 594 1208
4 California 53% 531 524 1055
5 Colorado 11% 606 595 1201
6 Connecticut 100% 530 512 1041
7 Delaware 100% 503 492 996
8 District of Columbia 100% 482 468 950
9 Florida 83% 520 497 1017
In [8]:
Out[8]:
State Participation English Math Reading Science Composite
0 National 60% 20.3 20.7 21.4 21.0 21.0
1 Alabama 100% 18.9 18.4 19.7 19.4 19.2
2 Alaska 65% 18.7 19.8 20.4 19.9 19.8
3 Arizona 62% 18.6 19.8 20.1 19.8 19.7
4 Arkansas 100% 18.9 19.0 19.7 19.5 19.4
5 California 31% 22.5 22.7 23.1 22.2 22.8
6 Colorado 100% 20.1 20.3 21.2 20.9 20.8
7 Connecticut 31% 25.5 24.6 25.6 24.6 25.2
8 Delaware 18% 24.1 23.4 24.8 23.6 24.1
9 District of Columbia 32% 24.4 23.5 24.9 23.5 24.2

3. Verbally Describe Data

Take your time looking through the data and thoroughly describe the data in the markdown cell below.

Answer:

The tables above displays US state-wise information on the SAT and ACT tests conducted for the year of 2017. The data comprises of the following,

  1. SAT (2017) - Math, ERW section scores being marked out of 800 (each), along with Total Scores, marked out of 1600 and finally the Participation Rate in percentages.

  2. ACT (2017) - English, Math, Reading and Science Section scores being marked out of 36, along with Composite Scores for all 4 sections marked out of 36, and the Participation Rates in percentage for the same year.

Incosistent Datatype - Important thing to consider here is total SAT scores are marked in integer values while the ACT score are floating point numbers.

Few Columns and Cells are named inconsistently. Ex. Last value in ACT - Composite Column has an 'x' while it should be all numeric type.

Lastly, I notice the ACT data has national average scores as first index while SAT seems to not provide that information.

4a. Does the data look complete?

Answer:

No missing or null values for both dataframes. We will need to convert the dtype for participation and composite columns to int/float values.

In [ ]:
In [9]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 5 columns):
 #   Column                              Non-Null Count  Dtype 
---  ------                              --------------  ----- 
 0   State                               51 non-null     object
 1   Participation                       51 non-null     object
 2   Evidence-Based Reading and Writing  51 non-null     int64 
 3   Math                                51 non-null     int64 
 4   Total                               51 non-null     int64 
dtypes: int64(3), object(2)
memory usage: 2.1+ KB
In [10]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52 entries, 0 to 51
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   State          52 non-null     object 
 1   Participation  52 non-null     object 
 2   English        52 non-null     float64
 3   Math           52 non-null     float64
 4   Reading        52 non-null     float64
 5   Science        52 non-null     float64
 6   Composite      52 non-null     object 
dtypes: float64(4), object(3)
memory usage: 3.0+ KB

4b. Are there any obvious issues with the observations?

What is the minimum possible value for each test/subtest? What is the maximum possible value?

Consider comparing any questionable values to the sources of your data:

Answer:

  • SAT - Math section has a minimim score if 52, we will consider this as an outlier of our dataset.
  • ACT - Science section has a minimum score of 2.3, we will consider this as an outlier of our dataset.
In [300]:
Out[300]:
states                            Alabama
erw_sat17                             482
math_sat17                             52
total_sat17                           950
participation_percentage_sat17          2
dtype: object
In [301]:
Out[301]:
states                            Alabama
english_act17                        16.3
math_act17                             18
reading_act17                        18.1
science_act17                         2.3
composite_act17                      17.8
participation_percentage_act17          8
dtype: object
In [ ]:

4c. Fix any errors you identified

The data is available so there's no need to guess or calculate anything. If you didn't find any errors, continue to the next step.

Answer:

  • The last index for ACT, under composite column the value contains additional characters that will be an obstacle later on. Lets change that.
In [12]:
Out[12]:
State Participation Evidence-Based Reading and Writing Math Total
0 Alabama 5% 593 572 1165
1 Alaska 38% 547 533 1080
2 Arizona 30% 563 553 1116
In [16]:
Out[16]:
State Participation English Math Reading Science Composite
49 West Virginia 69% 20.0 19.4 21.2 20.5 20.4
50 Wisconsin 100% 19.7 20.4 20.6 20.9 20.5
51 Wyoming 100% 19.4 19.8 20.8 20.6 20.2x
In [17]:
In [18]:
Out[18]:
'20.2x'
In [19]:
Out[19]:
State Participation English Math Reading Science Composite
50 Wisconsin 100% 19.7 20.4 20.6 20.9 20.5
51 Wyoming 100% 19.4 19.8 20.8 20.6 20.2

5. What are your data types?

Display the data types of each feature.

In [20]:
Out[20]:
State                                 object
Participation                         object
Evidence-Based Reading and Writing     int64
Math                                   int64
Total                                  int64
dtype: object
In [21]:
Out[21]:
State             object
Participation     object
English          float64
Math             float64
Reading          float64
Science          float64
Composite         object
dtype: object

What did you learn?

  • Do any of them seem odd?
  • Which ones are not as they should be?

Answer:

  • Yes. The 'Participation' & 'Composite' columns might have some missing values, they should be numeric.
  • Also, looks like all other column can be either integers/floats except the 'State' column, which can be of object datatype.

6. Fix Incorrect Data Types

Based on what you discovered above, use appropriate methods to re-type incorrectly typed data.

  • Define a function that will allow you to convert participation rates to an appropriate numeric type. Use map or apply to change these columns in each dataframe.
In [23]:
  • Fix any individual values preventing other columns from being the appropriate type.
In [24]:
In [25]:
In [28]:
  • Finish your data modifications by making sure the columns are now typed appropriately.
In [26]:
  • Display the data types again to confirm they are correct.
In [29]:
Out[29]:
State                                  object
Evidence-Based Reading and Writing    float64
Math                                  float64
Total                                 float64
Participation_%                       float64
dtype: object
In [30]:
Out[30]:
State               object
English            float64
Math               float64
Reading            float64
Science            float64
Composite          float64
Participation_%    float64
dtype: object

7. Rename Columns

Change the names of the columns to more expressive names so that you can tell the difference the SAT columns and the ACT columns. Your solution should map all column names being changed at once (no repeated singular name-changes). We will be combining these data with some of the data from 2018 and 2019, and so you should name columns in an appropriate way.

Guidelines:

  • Column names should be all lowercase (you will thank yourself when you start pushing data to SQL later in the course)
  • Column names should not contain spaces (underscores will suffice--this allows for using the df.column_name method to access columns in addition to df['column_name'].
  • Column names should be unique and informative (the only feature that we actually share between dataframes is the state).
In [32]:
Out[32]:
Index(['State', 'Evidence-Based Reading and Writing', 'Math', 'Total',
       'Participation_%'],
      dtype='object')
In [33]:
In [34]:
In [35]:
In [36]:
In [37]:
Out[37]:
Index(['state_sat17', 'erw_sat17', 'math_sat17', 'total_sat17',
       'participation_percentage_sat17'],
      dtype='object')
In [38]:
Out[38]:
Index(['state_act17', 'english_act17', 'math_act17', 'reading_act17',
       'science_act17', 'composite_act17', 'participation_percentage_act17'],
      dtype='object')

8. Create a data dictionary

Now that we've fixed our data, and given it appropriate names, let's create a data dictionary.

A data dictionary provides a quick overview of features/variables/columns, alongside data types and descriptions. The more descriptive you can be, the more useful this document is.

Example of a Fictional Data Dictionary Entry:

Feature Type Dataset Description
county_pop integer 2010 census The population of the county (units in thousands, where 2.5 represents 2500 people).
per_poverty float 2010 census The percent of the county over the age of 18 living below the 200% of official US poverty rate (units percent to two decimal places 98.10 means 98.1%)

Here's a quick link to a short guide for formatting markdown in Jupyter notebooks.

Provided is the skeleton for formatting a markdown table, with columns headers that will help you create a data dictionary to quickly summarize your data, as well as some examples. This would be a great thing to copy and paste into your custom README for this project.

Feature Type Dataset Description
column name int/float/object ACT/SAT This is an example

Data Dictionary

SAT Scores (Scholastic Assessment Test)
Feature Type Dataset Description
states (PK) object SAT Names of different US States ordered
alphabetically in ascending order
erw object SAT Section score for Evidence based
Reading & Writing, marked
between 200 & 800
math object SAT Section score for Math,
marked between 200 & 800
total object SAT Sum of both the Section scores,
ranges from 400 to 1600
participation_
percentage
float SAT Percentage (%) of students from
each State that took the test
more on SAT scores here

 

ACT Scores (American College Testing)
Feature Type Dataset Description
states (PK) object ACT Names of different US States ordered
alphabetically in ascending order
english float ACT sub-core for English section,
ranges from 1 to 36
math float ACT sub-core for Math section,
ranges from 1 to 36
reading float ACT sub-core for Reading section,
ranges from 1 to 36
science float ACT sub-core for Science section,
ranges from 1 to 36
composite/
total
float ACT Mean (Average) score based on English,
Math, Reading and Science subscores,
also ranges from 1 to 36
participation_
percentage
float ACT Percentage (%) of students from
each State that took the test
more on ACT scores here

*NOTE: Suffix: Ex. _act17 or _sat17 in dataframes indicates features from ACT or SAT Test 2017
*(PK) Primary Key column

9. Drop unnecessary rows

One of our dataframes contains an extra row. Identify and remove this from the dataframe.

Answer:

  • ACT has 1 additional row ('National) compared to SAT.
  • In order to merge on columns later, we will need to have equal number of rows for all data frames. So the best solution is to drop that row.
In [40]:
Out[40]:
state_act17 english_act17 math_act17 reading_act17 science_act17 composite_act17 participation_percentage_act17
0 National 20.3 20.7 21.4 21.0 21.0 60.0
1 Alabama 18.9 18.4 19.7 19.4 19.2 100.0
2 Alaska 18.7 19.8 20.4 19.9 19.8 65.0
In [41]:
In [42]:
In [43]:
In [45]:
Out[45]:
(51, 7)
In [47]:
Out[47]:
state_act17 english_act17 math_act17 reading_act17 science_act17 composite_act17 participation_percentage_act17
0 Alabama 18.9 18.4 19.7 19.4 19.2 100.0
1 Alaska 18.7 19.8 20.4 19.9 19.8 65.0
2 Arizona 18.6 19.8 20.1 19.8 19.7 62.0

10. Merge Dataframes

Join the 2017 ACT and SAT dataframes using the state in each dataframe as the key. Assign this to a new variable.

In [48]:
In [52]:
In [57]:
Out[57]:
states erw_sat17 math_sat17 total_sat17 participation_percentage_sat17 english_act17 math_act17 reading_act17 science_act17 composite_act17 participation_percentage_act17
0 Alabama 593.0 572.0 1165.0 5.0 18.9 18.4 19.7 19.4 19.2 100.0
1 Alaska 547.0 533.0 1080.0 38.0 18.7 19.8 20.4 19.9 19.8 65.0
2 Arizona 563.0 553.0 1116.0 30.0 18.6 19.8 20.1 19.8 19.7 62.0
In [58]:
Out[58]:
(51, 11)

11. Save your cleaned, merged dataframe

Use a relative path to save out your data as combined_2017.csv.

In [60]:

Data Import and Cleaning: 2018 and 2019

Read in the sat_2018.csv, sat_2019.csv, act_2018.csv and act_2019.csv files and assign them to appropriately named pandas dataframes. For the 2018-2019 ACT Data, only the Composite scores are available. Repeat the same processes to clean the 2018-2019 data here as you were instructed in the previous sections above.

In [61]:
In [62]:
In [63]:
In [64]:
In [65]:

Lets Clean SAT - 2019 dataframe first,

In [66]:
Out[66]:
State EBRW Math Total Participation_%
43 Tennessee 618 602 1220 7
44 Texas 515 507 1022 68
45 Utah 614 615 1230 4
46 Vermont 560 546 1106 66
47 Virgin Islands 490 445 935 Ñ
48 Virginia 567 551 1119 68
49 Washington 539 535 1074 70
50 West Virginia 483 460 943 99
51 Wisconsin 635 648 1283 3
52 Wyoming 623 615 1238 3
In [68]:
Out[68]:
array(['7', '41', '31', '6', '63', '100', '94', '71', '54', '66', '3',
       '4', '5', '99', '82', '81', '9', '20', '95', '18', '79', '51', '2',
       '19', '22', '70', 'Ñ', '68'], dtype=object)
In [67]:
Out[67]:
State Participation_%
39 Puerto Rico Ñ
47 Virgin Islands Ñ
In [69]:
Out[69]:
array(['100', '38', '73', '23', '27', '22', '13', '32', '54', '49', '80',
       '31', '35', '29', '66', '72', '6', '28', '21', '19', '95', '82',
       '14', '25', '63', '96', '42', '17', '12', '78', '75', '39', '20',
       '24', '52'], dtype=object)
In [ ]:
In [70]:
In [71]:
In [72]:
In [73]:
Out[73]:
(51, 5)

Clean ACT - 2019 dataframe

In [74]:
Out[74]:
State Composite Participation_%
42 Tennessee 19.4 100
43 Texas 20.5 39
44 Utah 20.3 100
45 Vermont 24.1 20
46 Virginia 24.0 21
47 Washington 22.1 24
48 West Virginia 20.8 49
49 Wisconsin 20.3 100
50 Wyoming 19.8 100
51 National 20.7 52
In [75]:
In [76]:
Out[76]:
(51, 3)

Clean SAT & ACT - 2018 dataframes

In [78]:
Out[78]:
State Composite Participation_%
15 Iowa 21.8 68
16 Kansas 21.6 71
17 Kentucky 20.2 100
18 Louisiana 19.2 100
19 Maine 24.0 7
20 Maine 24.0 7
21 Maryland 22.5 31
22 Massachusetts 25.5 25
23 Michigan 24.2 22
24 Minnesota 21.3 99
In [79]:
In [81]:
Out[81]:
(51, 3)
In [82]:
Out[82]:
State Composite Participation_%
0 Alabama 19.1 100
1 Alaska 20.8 33
2 Arizona 19.2 66
3 Arkansas 19.4 100
4 California 22.7 27
5 Colorado 23.9 30
6 Connecticut 25.6 26
7 Delaware 23.8 17
8 District of columbia 23.6 32
9 Florida 19.9 66
In [88]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [95]:
In [96]:
In [97]:
In [98]:
In [99]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 5 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   state_sat18                     51 non-null     object 
 1   erw_sat18                       51 non-null     float64
 2   math_sat18                      51 non-null     float64
 3   total_sat18                     51 non-null     float64
 4   participation_percentage_sat18  51 non-null     float64
dtypes: float64(4), object(1)
memory usage: 2.1+ KB
In [100]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 3 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   state_act18                     51 non-null     object 
 1   composite_act18                 51 non-null     float64
 2   participation_percentage_act18  51 non-null     float64
dtypes: float64(2), object(1)
memory usage: 1.3+ KB
In [101]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 5 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   state_sat19                     51 non-null     object 
 1   erw_sat19                       51 non-null     float64
 2   math_sat19                      51 non-null     float64
 3   total_sat19                     51 non-null     float64
 4   participation_percentage_sat19  51 non-null     float64
dtypes: float64(4), object(1)
memory usage: 2.1+ KB
In [102]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 51 entries, 0 to 50
Data columns (total 3 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   state_act19                     51 non-null     object 
 1   composite_act19                 51 non-null     float64
 2   participation_percentage_act19  51 non-null     float64
dtypes: float64(2), object(1)
memory usage: 1.6+ KB
In [104]:

Before we commit - state 'District Of Columbia' needs to be consistent across all dataframe

In [85]:
In [86]:
In [105]:

Combine your 2017 and 2018 data into a single dataframe

Joining on state names should work, assuming you formatted all your state names identically. Make sure none of your columns (other than state) have identical names. Do yourself a favor and decide if you're encoding participation rates as floats or integers and standardize this across your datasets.

Save the contents of this merged dataframe as final.csv.

Use this combined dataframe for the remainder of the project.

In [106]:
In [107]:
In [108]:
Out[108]:
states erw_sat17 math_sat17 total_sat17 participation_percentage_sat17 english_act17 math_act17 reading_act17 science_act17 composite_act17 participation_percentage_act17
In [112]:
Out[112]:
(51, 11)
In [113]:
Out[113]:
(51, 7)
In [114]:
Out[114]:
(51, 7)
In [109]:
In [110]:
Out[110]:
states erw_sat17 math_sat17 total_sat17 participation_percentage_sat17 english_act17 math_act17 reading_act17 science_act17 composite_act17 ... total_sat18 participation_percentage_sat18 composite_act18 participation_percentage_act18 erw_sat19 math_sat19 total_sat19 participation_percentage_sat19 composite_act19 participation_percentage_act19
0 Alabama 593.0 572.0 1165.0 5.0 18.9 18.4 19.7 19.4 19.2 ... 1166.0 6.0 19.1 100.0 583.0 560.0 1143.0 7.0 18.9 100.0
1 Alaska 547.0 533.0 1080.0 38.0 18.7 19.8 20.4 19.9 19.8 ... 1106.0 43.0 20.8 33.0 556.0 541.0 1097.0 41.0 20.1 38.0
2 Arizona 563.0 553.0 1116.0 30.0 18.6 19.8 20.1 19.8 19.7 ... 1149.0 29.0 19.2 66.0 569.0 565.0 1134.0 31.0 19.0 73.0

3 rows × 23 columns

In [115]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 51 entries, 0 to 50
Data columns (total 23 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   states                          51 non-null     object 
 1   erw_sat17                       51 non-null     float64
 2   math_sat17                      51 non-null     float64
 3   total_sat17                     51 non-null     float64
 4   participation_percentage_sat17  51 non-null     float64
 5   english_act17                   51 non-null     float64
 6   math_act17                      51 non-null     float64
 7   reading_act17                   51 non-null     float64
 8   science_act17                   51 non-null     float64
 9   composite_act17                 51 non-null     float64
 10  participation_percentage_act17  51 non-null     float64
 11  erw_sat18                       51 non-null     float64
 12  math_sat18                      51 non-null     float64
 13  total_sat18                     51 non-null     float64
 14  participation_percentage_sat18  51 non-null     float64
 15  composite_act18                 51 non-null     float64
 16  participation_percentage_act18  51 non-null     float64
 17  erw_sat19                       51 non-null     float64
 18  math_sat19                      51 non-null     float64
 19  total_sat19                     51 non-null     float64
 20  participation_percentage_sat19  51 non-null     float64
 21  composite_act19                 51 non-null     float64
 22  participation_percentage_act19  51 non-null     float64
dtypes: float64(22), object(1)
memory usage: 9.6+ KB
In [116]:
In [117]:
In [118]:
Out[118]:
Index(['states', 'erw_sat17', 'math_sat17', 'total_sat17',
       'participation_percentage_sat17', 'composite_act17',
       'participation_percentage_act17', 'erw_sat18', 'math_sat18',
       'total_sat18', 'participation_percentage_sat18', 'composite_act18',
       'participation_percentage_act18', 'erw_sat19', 'math_sat19',
       'total_sat19', 'participation_percentage_sat19', 'composite_act19',
       'participation_percentage_act19'],
      dtype='object')
In [119]:
Out[119]:
states sat17_erw sat17_math sat17_total sat17_participation_percent act17_total act17_participation_percent sat18_erw sat18_math sat18_total sat18_participation_percent act18_total act18_participation_percent sat19_erw sat19_math sat19_total sat19_participation_percent act19_total act19_participation_percent
0 Alabama 593.0 572.0 1165.0 5.0 19.2 100.0 595.0 571.0 1166.0 6.0 19.1 100.0 583.0 560.0 1143.0 7.0 18.9 100.0
1 Alaska 547.0 533.0 1080.0 38.0 19.8 65.0 562.0 544.0 1106.0 43.0 20.8 33.0 556.0 541.0 1097.0 41.0 20.1 38.0
2 Arizona 563.0 553.0 1116.0 30.0 19.7 62.0 577.0 572.0 1149.0 29.0 19.2 66.0 569.0 565.0 1134.0 31.0 19.0 73.0
3 Arkansas 614.0 594.0 1208.0 3.0 19.4 100.0 592.0 576.0 1169.0 5.0 19.4 100.0 582.0 559.0 1141.0 6.0 19.3 100.0
4 California 531.0 524.0 1055.0 53.0 22.8 31.0 540.0 536.0 1076.0 60.0 22.7 27.0 534.0 531.0 1065.0 63.0 22.6 23.0
In [120]:
In [121]:
Out[121]:
states sat17_erw sat18_erw sat19_erw sat17_math sat18_math sat19_math sat17_total sat18_total sat19_total sat17_participation_percent sat18_participation_percent sat19_participation_percent act17_total act18_total act19_total act17_participation_percent act18_participation_percent act19_participation_percent
46 Virginia 561.0 567.0 567.0 541.0 550.0 551.0 1102.0 1117.0 1119.0 65.0 68.0 68.0 23.8 23.9 24.0 29.0 24.0 21.0
47 Washington 541.0 543.0 539.0 534.0 538.0 535.0 1075.0 1081.0 1074.0 64.0 69.0 70.0 21.9 22.2 22.1 29.0 24.0 24.0
48 West Virginia 558.0 513.0 483.0 528.0 486.0 460.0 1086.0 999.0 943.0 14.0 28.0 99.0 20.4 20.3 20.8 69.0 65.0 49.0
49 Wisconsin 642.0 641.0 635.0 649.0 653.0 648.0 1291.0 1294.0 1283.0 3.0 3.0 3.0 20.5 20.5 20.3 100.0 100.0 100.0
50 Wyoming 626.0 633.0 623.0 604.0 625.0 615.0 1230.0 1257.0 1238.0 3.0 3.0 3.0 20.2 20.0 19.8 100.0 100.0 100.0

Save the file - 2017-2018-2019 (SAT + ACT) merged - final.csv

In [123]:

Exploratory Data Analysis

Summary Statistics

Transpose the output of pandas describe method to create a quick overview of each numeric feature.

In [125]:
In [126]:
In [127]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 19 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   states                       51 non-null     object 
 1   sat17_erw                    51 non-null     float64
 2   sat18_erw                    51 non-null     float64
 3   sat19_erw                    51 non-null     float64
 4   sat17_math                   51 non-null     float64
 5   sat18_math                   51 non-null     float64
 6   sat19_math                   51 non-null     float64
 7   sat17_total                  51 non-null     float64
 8   sat18_total                  51 non-null     float64
 9   sat19_total                  51 non-null     float64
 10  sat17_participation_percent  51 non-null     float64
 11  sat18_participation_percent  51 non-null     float64
 12  sat19_participation_percent  51 non-null     float64
 13  act17_total                  51 non-null     float64
 14  act18_total                  51 non-null     float64
 15  act19_total                  51 non-null     float64
 16  act17_participation_percent  51 non-null     float64
 17  act18_participation_percent  51 non-null     float64
 18  act19_participation_percent  51 non-null     float64
dtypes: float64(18), object(1)
memory usage: 7.7+ KB
In [128]:
In [129]:
Out[129]:
count mean std min 25% 50% 75% max
sat17_erw 51.0 569.117647 45.666901 482.0 533.50 559.0 613.00 644.0
sat18_erw 51.0 563.686275 47.502627 480.0 534.50 552.0 610.50 643.0
sat19_erw 51.0 560.803922 46.248035 483.0 527.50 550.0 612.00 636.0
sat17_math 51.0 547.627451 84.909119 52.0 522.00 548.0 599.00 651.0
sat18_math 51.0 556.235294 47.772623 480.0 522.50 544.0 593.50 655.0
sat19_math 51.0 552.196078 51.290553 460.0 511.50 546.0 599.00 648.0
sat17_total 51.0 1126.098039 92.494812 950.0 1055.50 1107.0 1212.00 1295.0
sat18_total 51.0 1120.019608 94.155083 977.0 1057.50 1098.0 1204.00 1298.0
sat19_total 51.0 1113.078431 97.197910 943.0 1038.00 1097.0 1210.00 1284.0
sat17_participation_percent 51.0 39.803922 35.276632 2.0 4.00 38.0 66.00 100.0
sat18_participation_percent 51.0 45.745098 37.314256 2.0 4.50 52.0 77.50 100.0
sat19_participation_percent 51.0 49.058824 38.747600 2.0 5.50 54.0 82.00 100.0
act17_total 51.0 21.519608 2.020695 17.8 19.80 21.4 23.60 25.5
act18_total 51.0 21.496078 2.111583 17.7 19.95 21.3 23.65 25.6
act19_total 51.0 21.464706 2.194523 17.9 19.80 21.1 23.70 25.5
act17_participation_percent 51.0 65.254902 32.140842 8.0 31.00 69.0 100.00 100.0
act18_participation_percent 51.0 61.725490 34.037085 7.0 28.50 66.0 100.00 100.0
act19_participation_percent 51.0 58.666667 34.515890 6.0 24.50 54.0 100.00 100.0
In [138]:
84.90911865855486 47.77262322095955

Manually calculate standard deviation

σ=1ni=1n(xiμ)2

  • Write a function to calculate standard deviation using the formula above
In [139]:
In [140]:
Out[140]:
2.001
In [141]:
Out[141]:
2.001
  • Use a dictionary comprehension to apply your standard deviation function to each numeric column in the dataframe. No loops
  • Assign the output to variable sd as a dictionary where:
    • Each column name is now a key
    • That standard deviation of the column is the value

Example Output : {'ACT_Math': 120, 'ACT_Reading': 120, ...}

In [142]:
Out[142]:
(51, 19)
In [143]:
In [144]:
Out[144]:
(51, 18)
In [145]:
Out[145]:
2.001
In [146]:
Out[146]:
{'sat17_erw': 45.217,
 'sat18_erw': 47.035,
 'sat19_erw': 45.792,
 'sat17_math': 84.073,
 'sat18_math': 47.302,
 'sat19_math': 50.785,
 'sat17_total': 91.584,
 'sat18_total': 93.227,
 'sat19_total': 96.24,
 'sat17_participation_percent': 34.929,
 'sat18_participation_percent': 36.947,
 'sat19_participation_percent': 38.366,
 'act17_total': 2.001,
 'act18_total': 2.091,
 'act19_total': 2.173,
 'act17_participation_percent': 31.824,
 'act18_participation_percent': 33.702,
 'act19_participation_percent': 34.176}
In [147]:
Out[147]:
{'sat17_erw': 45.217,
 'sat18_erw': 47.035,
 'sat19_erw': 45.792,
 'sat17_math': 84.073,
 'sat18_math': 47.302,
 'sat19_math': 50.785,
 'sat17_total': 91.584,
 'sat18_total': 93.227,
 'sat19_total': 96.24,
 'sat17_participation_percent': 34.929,
 'sat18_participation_percent': 36.947,
 'sat19_participation_percent': 38.366,
 'act17_total': 2.001,
 'act18_total': 2.091,
 'act19_total': 2.173,
 'act17_participation_percent': 31.824,
 'act18_participation_percent': 33.702,
 'act19_participation_percent': 34.176}
In [148]:

Do your manually calculated standard deviations match up with the output from pandas describe? What about numpy's std method?

Answer:

The manually calculated standard deviations are preety close to the output from pandas 'describe' method. While with numpy's 'std' method, it matches perfectly.

Using sorting and/or masking (along with the .head method to not print our entire dataframe), consider the following questions:

  • Which states have the highest and lowest participation rates for the:
    • 2017 SAT?
    • 2018 SAT?
    • 2017 ACT?
    • 2018 ACT?
  • Which states have the highest and lowest mean total/composite scores for the:
    • 2017 SAT?
    • 2018 SAT?
    • 2017 ACT?
    • 2018 ACT?
  • Do any states with 100% participation on a given test have a rate change year-to-year?
  • Do any states show have >50% participation on both tests either year?

Based on what you've just observed, have you identified any states that you're especially interested in? Make a note of these and state why you think they're interesting.

You should comment on your findings at each step in a markdown cell below your code block. Make sure you include at least one example of sorting your dataframe by a column, and one example of using boolean filtering (i.e., masking) to select a subset of the dataframe.

In [149]:
Out[149]:
sat17_participation_percent sat18_participation_percent sat19_participation_percent
0 5.0 6.0 7.0
1 38.0 43.0 41.0
2 30.0 29.0 31.0
3 3.0 5.0 6.0
4 53.0 60.0 63.0
In [150]:
In [151]:
Out[151]:
644.0

Max_Min Participation (%) for each year

In [152]:
Out[152]:
sat17_total sat17_participation_percent
states
Iowa 1275.0 2.0
Mississippi 1242.0 2.0
North Dakota 1256.0 2.0
Connecticut 1041.0 100.0
Delaware 996.0 100.0
District Of Columbia 950.0 100.0
Michigan 1005.0 100.0
In [153]:
Out[153]:
sat18_total sat18_participation_percent
states
North Dakota 1283.0 2.0
Colorado 1025.0 100.0
Connecticut 1053.0 100.0
Delaware 998.0 100.0
Idaho 1001.0 100.0
Michigan 1011.0 100.0
In [154]:
Out[154]:
sat19_total sat19_participation_percent
states
North Dakota 1263.0 2.0
Colorado 1024.0 100.0
Connecticut 1046.0 100.0
Delaware 985.0 100.0
Florida 999.0 100.0
Idaho 993.0 100.0
Illinois 1013.0 100.0
Michigan 1003.0 100.0
Rhode Island 995.0 100.0
In [155]:
Out[155]:
act17_total act17_participation_percent
states
Maine 24.3 8.0
Alabama 19.2 100.0
Utah 20.3 100.0
Tennessee 19.8 100.0
South Carolina 18.7 100.0
Oklahoma 19.4 100.0
North Carolina 19.1 100.0
Nevada 17.8 100.0
Montana 20.3 100.0
Missouri 20.4 100.0
Mississippi 18.6 100.0
Minnesota 21.5 100.0
Louisiana 19.5 100.0
Kentucky 20.0 100.0
Colorado 20.8 100.0
Arkansas 19.4 100.0
Wisconsin 20.5 100.0
Wyoming 20.2 100.0
In [156]:
Out[156]:
act18_total act18_participation_percent
states
Maine 24.0 7.0
Alabama 19.1 100.0
Utah 20.4 100.0
Tennessee 19.6 100.0
South Carolina 18.3 100.0
Oklahoma 19.3 100.0
Ohio 20.3 100.0
North Carolina 19.1 100.0
Nevada 17.7 100.0
Nebraska 20.1 100.0
Montana 20.0 100.0
Missouri 20.0 100.0
Mississippi 18.6 100.0
Louisiana 19.2 100.0
Kentucky 20.2 100.0
Arkansas 19.4 100.0
Wisconsin 20.5 100.0
Wyoming 20.0 100.0
In [157]:
Out[157]:
act19_total act19_participation_percent
states
Maine 24.3 6.0
Alabama 18.9 100.0
Arkansas 19.3 100.0
Kentucky 19.8 100.0
Louisiana 18.8 100.0
Mississippi 18.4 100.0
Montana 19.8 100.0
Nebraska 20.0 100.0
Nevada 17.9 100.0
North Carolina 19.0 100.0
Ohio 20.0 100.0
Oklahoma 18.9 100.0
Tennessee 19.4 100.0
Utah 20.3 100.0
Wisconsin 20.3 100.0
Wyoming 19.8 100.0

Max_Min Total/Composite Scores

In [159]:
Out[159]:
sat17_total sat17_participation_percent
states
District Of Columbia 950.0 100.0
Minnesota 1295.0 3.0
In [160]:
Out[160]:
sat18_total sat18_participation_percent
states
District Of Columbia 977.0 92.0
Minnesota 1298.0 4.0
In [161]:
Out[161]:
sat19_total sat19_participation_percent
states
West Virginia 943.0 99.0
Minnesota 1284.0 4.0
In [162]:
Out[162]:
act17_total act17_participation_percent
states
Nevada 17.8 100.0
New Hampshire 25.5 18.0
In [163]:
Out[163]:
act18_total act18_participation_percent
states
Nevada 17.7 100.0
Connecticut 25.6 26.0
In [164]:
Out[164]:
act19_total act19_participation_percent
states
Nevada 17.9 100.0
Connecticut 25.5 22.0
Massachusetts 25.5 21.0

First Impressions: SAT Participation Rate Is Getting Better!

In [302]:
Out[302]:
states sat17_participation_percent sat18_participation_percent sat19_participation_percent
6 Connecticut 100.0 100.0 100.0
7 Delaware 100.0 100.0 100.0
8 District Of Columbia 100.0 92.0 94.0
22 Michigan 100.0 100.0 100.0
12 Idaho 93.0 100.0 100.0
9 Florida 83.0 56.0 100.0
39 Rhode Island 71.0 97.0 100.0
5 Colorado 11.0 100.0 100.0
13 Illinois 9.0 99.0 100.0
In [293]:
Out[293]:
6
In [294]:
Out[294]:
9
In [292]:
Out[292]:
12
In [170]:
In [171]:
Out[171]:
states act17_participation_percent act18_participation_percent act19_participation_percent
0 Alabama 100.0 100.0 100.0
3 Arkansas 100.0 100.0 100.0
5 Colorado 100.0 30.0 27.0
17 Kentucky 100.0 100.0 100.0
18 Louisiana 100.0 100.0 100.0
23 Minnesota 100.0 99.0 95.0
24 Mississippi 100.0 100.0 100.0
25 Missouri 100.0 100.0 82.0
26 Montana 100.0 100.0 100.0
27 Nebraska 84.0 100.0 100.0
28 Nevada 100.0 100.0 100.0
33 North Carolina 100.0 100.0 100.0
35 Ohio 75.0 100.0 100.0
36 Oklahoma 100.0 100.0 100.0
40 South Carolina 100.0 100.0 78.0
42 Tennessee 100.0 100.0 100.0
44 Utah 100.0 100.0 100.0
49 Wisconsin 100.0 100.0 100.0
50 Wyoming 100.0 100.0 100.0

Do any states with 100% participation on a given test have a rate change year-to-year?

  • Yes, year-on-year rate change does exist for states with 100% participation.

  • for SAT's Colorado, Idaho, Illinois and Rhode Island went from lower participation in 2017 to 100% by 2019. While the District of Columbia, went down from 100% to 94% in the same duration. Florida, oddly went down from 845 to 56% but showed a significant increase of almost 100% between 2018 and 2019.

  • for ACT's Colorado had a significant drop from 100% in 2017 to 27% in 2019. Missouri and South Carolina exibit a downward trend for the same years. On the other hand a lot more participation is seen from the states of Ohio and Nebraksa every year.

  • to sum up, a lot more states happen to show participation in ACT's compared to SAT's as we go from 2017 to 2019.

In [172]:
Out[172]:
states sat17_participation_percent sat18_participation_percent sat19_participation_percent act17_participation_percent act18_participation_percent act19_participation_percent
9 Florida 83.0 56.0 100.0 73.0 66.0 54.0
10 Georgia 61.0 70.0 71.0 55.0 53.0 49.0
11 Hawaii 55.0 56.0 54.0 90.0 89.0 80.0
33 North Carolina 49.0 52.0 51.0 100.0 100.0 100.0
40 South Carolina 50.0 55.0 68.0 100.0 100.0 78.0

Do any states show have >50% participation on both tests either year?

  • Yes, the states of Florida, Georgia, Hawaii, North Carolina, South Carolina have an average of more than 50% participation for both the tests.

  • INTERESTING FACT: From 2017 to 2019, Florida shows about 20% increase in SAT participation, at the same time it shows 20% drop in ACT participation

Visualization

Visualize the data

There's not a magic bullet recommendation for the right number of plots to understand a given dataset, but visualizing your data is always a good idea. Not only does it allow you to quickly convey your findings (even if you have a non-technical audience), it will often reveal trends in your data that escaped you when you were looking only at numbers.

Some recommendations on plotting:

  • Plots have titles
  • Plots have axis labels
  • Plots have appropriate tick labels
  • All text is legible in a plot
  • Plots demonstrate meaningful and valid relationships
  • Plots are interpreted to aid understanding

There is such a thing as too many plots, and there are a lot of bad plots. You might make some! (But hopefully not with the guided prompts below).

Use Seaborn's heatmap with pandas .corr() to visualize correlations between all numeric features

Heatmaps are generally not appropriate for presentations, and should often be excluded from reports as they can be visually overwhelming. However, they can be extremely useful in identify relationships of potential interest (as well as identifying potential collinearity before modeling).

example:

sns.heatmap(df.corr())

Please take time to format your output, adding a title. Look through some of the additional arguments and options. (Axis labels aren't really necessary, as long as the title is informative).

In [173]:
In [174]:
In [181]:

Heatmap-1 for Total score for each year:

  • Strong positive correlation is seen between all SAT scores for different years, and even better correlation among ACT scores for different years.
  • While SAT and ACT scores have clear negative correlation, but not strong.
In [182]:

Heatmap-2 for Participation % for each year:

  • Very Strong negative Correlation between SAT and ACT participation rates - this tells us that different states prefer different testing systems.
In [184]:

HEATMAP-3 Total+Participation combined EXPLAINATION

  • Strong negative correlation between participation and total - this confirms smaller participation leads to higher total and viceversa

Define a custom function to subplot histograms

We have data for two tests for two years. We only have composite (and not subtest scores) for the 2018 ACT. We should write a function that will take the names of 2+ columns and subplot histograms. While you can use pandas plotting or Seaborn here, matplotlib gives you greater control over all aspects of your plots.

Helpful Link for Plotting Multiple Figures

Here's some starter code:

In [185]:

Plot and interpret histograms

For each of the following:

  • Participation rates for SAT & ACT
  • Math scores for SAT & ACT
  • Reading/verbal scores for SAT & ACT
In [186]:
In [272]:
In [188]:
Out[188]:
23
In [189]:
Out[189]:
1

PLOT#1 Participation rates for SAT & ACT

  1. For SAT Participation, more that 50% of the states have either very high or extremenly very low participation
  2. Negative correlation between SAT and ACT participation can be confirmed using this plot. States with 0-10% participation for SAT's and 90-100% participation for ACT's.
  3. High's and Low's for SAT's and ACT's clearly do not match. (Top 3 versus Bottom 3 plots)
In [190]:
In [195]:
/Users/samay20/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
In [194]:
Out[194]:
547.6274509803922
In [196]:
/Users/samay20/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

PLOT#2 Math Scores for SAT & ACT

  • SAT - Mean and Median fall almost at the same point, we can say that the scores are normally distributed (Mean=Median=548)
  • ACT - Similar to the plot for SAT scores, ACT scores are also normally distributed (mean=median=21)
In [200]:
/Users/samay20/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
In [201]:
/Users/samay20/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
In [202]:
Out[202]:
22.013725490196077

PLOT#3 Reading/verbal Scores for SAT & ACT

  • SAT - for ERW Score, Mean > Median, Therefore we can say the distribution is positively skewed. (almost 2 modes)
  • ACT - The scores are normally distributed (Mean=Median=22)

Plot and interpret scatter plots

For each of the following:

  • SAT vs. ACT math scores for 2017
  • SAT vs. ACT verbal/reading scores for 2017
  • SAT vs. ACT total/composite scores for 2017
  • Total scores for SAT 2017 vs. 2018
  • Composite scores for ACT 2017 vs. 2018

Plot the two variables against each other using matplotlib or Seaborn

Your plots should show:

  • Two clearly labeled axes
  • A proper title
  • Using colors and symbols that are clear and unmistakable

Feel free to write a custom function, and subplot if you'd like. Functions save both time and space.

In [203]:
In [204]:
In [205]:
In [208]:
<Figure size 720x720 with 0 Axes>
In [207]:
Out[207]:
51

Scatterplot for Plot# 1 (Math Scores SAT vs ACT, 2017)

  • Removed the outlier for 'math_sat17' column.
  • Little negative collinearity, but no clear relationship among data
  • I can see 3 groups of scorers,
    o 1. states with very high average ACT_Math score and very low average SAT_Math scores
    o 2. states with very low average ACT_Math score and very high average SAT_Math scores
    o 3. states with average scores for both the tests
    o This kind of backsup our assumption, probably the scores are directly related to states participation.
In [211]:
<Figure size 720x720 with 0 Axes>

Scatterplot for Plot# 2 (Verbal/Reading Scores SAT vs ACT, 2017)

  • Similar to Math scores, little negative collinearity, but no clear relationship among data
  • I can see 3 groups of scorers,
    o 1. states with very high average ACT_reading score and very low average SAT_erw scores
    o 2. states with very low average ACT_reading score and very high average SAT_erw scores
    o 3. states with average scores for both the tests
    o This kind of backsup our assumption, probably the scores are directly related to states participation.
In [212]:
<Figure size 720x720 with 0 Axes>

Scatterplot for Plot# 3 (Total/Composite Scores SAT vs ACT, 2017)

  • Similar to Math/Reading scores, little negative collinearity, but no clear relationship among data
  • I can see 3 groups of scorers,
    o 1. states with very high average ACT_Composite score and very low average SAT_Total scores
    o 2. states with very low average ACT_Composite score and very high average SAT_Total scores
    o 3. states with average scores for both the tests
    o This kind of backsup our assumption, probably the scores are directly related to states participation.
In [213]:
In [214]:
In [215]:
In [217]:
In [218]:
Out[218]:
sat17_total sat18_total sat19_total act17_total act18_total act19_total
sat17_total 1.000000 0.852727 0.900690 -0.445020 -0.421317 -0.444732
sat18_total 0.852727 1.000000 0.905423 -0.357463 -0.412421 -0.444598
sat19_total 0.900690 0.905423 1.000000 -0.341370 -0.390046 -0.424782
act17_total -0.445020 -0.357463 -0.341370 1.000000 0.941365 0.934117
act18_total -0.421317 -0.412421 -0.390046 0.941365 1.000000 0.991787
act19_total -0.444732 -0.444598 -0.424782 0.934117 0.991787 1.000000

Scatterplot for Plot# 4 (Includes Total Comparisons between, Total scores in SAT, Composite Scores in ACT and SAT/ACT Participation

  • SAT Total Scores: 2017 vs 2018 vs 2019 - strong positive correlation, tells us that the states are consistent with the SAT scores
  • ACT Composite Scores: 2017 vs 2018 vs 2019 - strong positive correlation, tells us that the states are consistent with the ACT scores
  • SAT Vs ACT Participation: 2017 vs 2018 vs 2019 - clear negative correlation, tells us that the states with higher SAT participation have lower ACT participation rate and viceversa. Not many states participate in both the test. States include Florida, Georgia, Huwaii, North Carolina and South Carolina

Plot and interpret boxplots

For each numeric variable in the dataframe create a boxplot using Seaborn. Boxplots demonstrate central tendency and spread in variables. In a certain sense, these are somewhat redundant with histograms, but you may be better able to identify clear outliers or differences in IQR, etc.

Multiple values can be plotted to a single boxplot as long as they are of the same relative scale (meaning they have similar min/max values).

Each boxplot should:

  • Only include variables of a similar scale
  • Have clear labels for each variable
  • Have appropriate titles and labels
In [219]:
In [222]:
In [221]:
Out[221]:
564.5359477124183

Box Plot# 1 (SAT-ERW)

  • Consistent scores across all 3 years, no outliers in the dataset
  • Maintains average grades across US Avg. Median ~ 553 while Avg. Mean Score ~ 5640
In [223]:
In [224]:
Out[224]:
546.0

Box Plot2 1 (SAT-Math)

  • More than 75% of the states score well above average for this section of SAT
  • Maintains average grades across US Avg. Median ~ 546 while Avg. Mean Score ~ 552 (Right Skew)
In [249]:
In [226]:
Out[226]:
1100.6666666666667

Box Plot# 3 (SAT-Total)

  • Quiet big interquartile range, but almost 25% of the states score well above the average.
  • Maintains average grades across US Avg. Median ~ 1100 while Avg. Mean Score ~ 1120 (Right Skew)

Participation Rate: Box Plots tells it all!

In [295]:
In [248]:
In [229]:
Out[229]:
17
In [230]:
Out[230]:
17
In [231]:
Out[231]:
61.882352941176464

Box Plot# 4 (SAT & ACT -Participation%)

  • Number of states with Avg. SAT participation > 70 is exactly same as states with Avg. ACT participation < 30
  • SAT - US Avg. Median ~ 48% while Avg. Mean Score ~ 44% (Left Skew)
  • ACT - US Avg. Median ~ 63% while Avg. Mean Score ~ 62% (Left Skew)
  • This tells us on Avg. atleast 60% of States participate in ACT while less than 50% of the States participate in SAT
  • Lastly, both of them show a consistently increasing participation rate
In [232]:
In [234]:
Out[234]:
<AxesSubplot:xlabel='act19_total', ylabel='Count'>
In [235]:
Out[235]:
21.1

Box Plot# 5 (ACT-Total)

  • Quiet big interquartile range, but almost 25% of the states score well above the average.
  • ACT Total: 2017 (Mean ~ Median ~ 21.5), 2018 (Mean ~ Median ~ 21.3) and 2019 (Mean ~ 21.4, Median ~ 21.1)
  • we can say that over the 3 year period Median dropped a bit, while Mean remaind steady.

Feel free to do additional plots below

(do research and choose your own chart types & variables)

Are there any additional trends or relationships you haven't explored? Was there something interesting you saw that you'd like to dive further into? It's likely that there are a few more plots you might want to generate to support your narrative and recommendations that you are building toward. As always, make sure you're interpreting your plots as you go.

In [236]:
Out[236]:
sat17_erw sat18_erw sat19_erw sat17_math sat18_math sat19_math
sat17_erw 1.000000 0.814096 0.898352 0.628405 0.865909 0.876606
sat18_erw 0.814096 1.000000 0.865572 0.487915 0.954661 0.831921
sat19_erw 0.898352 0.865572 1.000000 0.562127 0.940199 0.986026
sat17_math 0.628405 0.487915 0.562127 1.000000 0.537222 0.564127
sat18_math 0.865909 0.954661 0.940199 0.537222 1.000000 0.933717
sat19_math 0.876606 0.831921 0.986026 0.564127 0.933717 1.000000
In [237]:
Out[237]:
<AxesSubplot:>

Interpretation

  • Math and ERW possess a very strong correlation for SAT's.

(Optional): Using Tableau, create a choropleth map for each variable using a map of the US.

Save this plot as an image file in an images directory, provide a relative path, and insert the image into notebook in markdown.

In [ ]:

Top 3 States based on Participation Rates for SAT/ACT across all 3 years

Based upon your observations, choose three states that demonstrate interesting trends in their SAT and/or ACT participation rates. Spend some time doing outside research on state policies that might influence these rates, and summarize your findings below. Feel free to go back and create new plots that highlight these states of interest. If you bring in any outside tables or charts, make sure you are explicit about having borrowed them. If you quote any text, make sure that it renders as being quoted. (Make sure that you cite your sources -- check with you local instructor for citation preferences).

In [296]:
In [241]:
Out[241]:
sat17_participation_percent sat18_participation_percent sat19_participation_percent
states
Illinois 9.0 99.0 100.0
In [242]:
Out[242]:
act17_participation_percent act18_participation_percent act19_participation_percent
states
Colorado 100.0 30.0 27.0
In [243]:
Out[243]:
sat17_participation_percent sat18_participation_percent sat19_participation_percent
states
Hawaii 55.0 56.0 54.0
In [244]:
Out[244]:
act17_participation_percent act18_participation_percent act19_participation_percent
states
Hawaii 90.0 89.0 80.0
In [245]:
Out[245]:
sat17_participation_percent sat18_participation_percent sat19_participation_percent
states
Rhode Island 71.0 97.0 100.0
In [246]:
Out[246]:
sat17_participation_percent sat18_participation_percent sat19_participation_percent
states
District Of Columbia 100.0 92.0 94.0

My Choice, top 3 states

Top States that demonstrate interesting trends are,

Top 3,

  • Illinois - Tremendous Growth in SAT participation
  • Colorado - Tremendous Drop in ACT participation
  • Hawaii - Best all rounder, based on balance between SAT and ACT participation
    • (Hawaii is a very good example of a State that we should focus on, ACT participation rates are constantly dropping, while SAT participation rates are almost the same, for 2017, 2018 and 2019.)

Other,

  • Rhode Island - Tremendous Growth in SAT participation
  • District of Columbia - Decline in SAT participation
    • (Another good State for the School Board to focus on should be District of Columbia, loosing already gained participation is always a matter of deep concern.)

Conclusions

Based on your exploration of the data, what are you key takeaways and recommendations? Choose one state with a lower participation rate and provide a suggestion for how the College Board might increase participation amongst graduating seniors in this state. Are there additional data you desire that would better inform your investigations?

In [250]:
In [229]:
Out[229]:
17
In [230]:
Out[230]:
17
In [231]:
Out[231]:
61.882352941176464
In [270]:
In [273]:
In [275]:
Out[275]:
sat17_participation_percent sat18_participation_percent sat19_participation_percent
states
Hawaii 55.0 56.0 54.0
In [244]:
Out[244]:
act17_participation_percent act18_participation_percent act19_participation_percent
states
Hawaii 90.0 89.0 80.0
In [276]:
Out[276]:
sat17_participation_percent sat18_participation_percent sat19_participation_percent
states
District Of Columbia 100.0 92.0 94.0

Final Statement - Conclusion

  • From the box plot above it's clear that on Avg. atleast 60% of States participate in ACT while less than 50% of the States participate in SAT.
  • The strong non-collinearity between Total Score and Participation Percentages is a matter of concern too.
  • SAT - Math and ERW scores possess strong colliearity
  • Most of the Distributions, specially various Participation Rates and the Total Scores are BiModal in nature i.e not distribued normally, possible factors like mandatory SAT/ACT tests classifies candidates into High Scoring and Low Scoring Candidates. Some have all the facilities and want to take the test, while the other just have to.

Recommendations

Recommendations

  • Investigate what strategies / implementations went right for States with increasing participation rates. This includes States like New York, Maryland, New Jersey, Maine, West Virginia, Colorado, Florida, Idaho, Illinois and Rhode Island. Like this one SAT scores drop for 2019 class, but participation rises through testing in schools
  • Research show giving 'FREE' SAT tests is a key strategy and should be applied for all states.
  • Focusing more on States like Hawaii might help.
    • Hawaii displayed alsmost same Participation % for 3 years (55% -> 56% -> 54%) for SAT, while the ACT participation rates have dropped from 90% in 2017 to 80% in 2019, unlike other states where the participation rates usually transfer from ACT to SAT and vice-versa.
  • District of Columbia, is another example of a state where SAT participation rates dropped from 100% in 2017 to 94% in 2019. We should definitely get some insights on what went wrong.

Based on Research

  • Previous changes like Dropping Essay Requirement, changing Exam Duration to 3hrs and engage into contracts with states and schools to make SAT's affordable are all in right direction. We should expand these ideas to other states with low SAT scores too. DataFrame right underneath this section backs this statement. Reference - 2016 regulations
In [283]:
Out[283]:
states sat17_participation_percent sat18_participation_percent sat19_participation_percent
6 Connecticut 100.0 100.0 100.0
7 Delaware 100.0 100.0 100.0
8 District Of Columbia 100.0 92.0 94.0
22 Michigan 100.0 100.0 100.0
12 Idaho 93.0 100.0 100.0
9 Florida 83.0 56.0 100.0
39 Rhode Island 71.0 97.0 100.0
5 Colorado 11.0 100.0 100.0
13 Illinois 9.0 99.0 100.0

What Next!

Scope of this Project

  • I would like to consider different factors outside of this dataset including but not limited to Racial Impact SAT's Racial Impact

Bonus: Descriptive and Inferential Statistics

Summarizing Distributions

Above, we used pandas describe to provide quick summary statistics of our numeric columns. We also demonstrated many visual relationships.

As data scientists, having a complete understanding of data is imperative prior to modeling.

While we will continue to build our analytic tools, we know that measures of central tendency, spread, and shape/skewness provide a quick summary of distributions.

For each variable in your data, summarize the underlying distributions (in words & statistics)

  • Be thorough in your verbal description of these distributions.
  • Be sure to back up these summaries with statistics.

Answers:

We generally assuming that data we sample from a population will be normally distributed. Do we observe this trend?

Answer:

Does This Assumption Hold for: - Math - Reading - Rates Explain your answers for each distribution and how you think this will affect estimates made from these data.

Answer:

Estimate Limits of Data

Suppose we only seek to understand the relationship between SAT and ACT participation rates in 2017.

Does it make sense to conduct statistical inference given these data specifically?

Why or why not?

(think about granularity, aggregation, the relationships between populations size & rates...consider the actually populations these data describe in answering this question)

Answer:

Is it appropriate to compare these specific SAT and ACT math scores?

Why or why not?

Answer:

Statistical Evaluation of Distributions

If you feel it's appropriate, using methods we discussed in class, run hypothesis tests to compare variables of interest in our dataset.

In [ ]: